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Abstract. The real symmetric tridiagonal eigenproblem is of outstanding importance in numerical computations; 
it arises frequently as part of eigensolvers for standard and generalized dense Hermitian eigenproblems that are based 
on a reduction to tridiagonal form. For its solution, the algorithm of Multiple Relatively Robust Representations 
(MRRR) is among the fastest methods. Although fast, the solvers based on MRRR do not deliver the same accuracy 
as competing methods like Divide & Conquer or the QR algorithm. In this paper, we demonstrate that the use of 
mixed precisions leads to improved accuracy of MRRR-based eigensolvers with limited or no performance penalty. 
As a result, we obtain eigensolvers that are not only equally or more accurate than the best available methods, but 
also — under most circumstances - faster and more scalable than the competition. 
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1. Introduction. In 21j, the authors describe how in libraries the use of "higher internal 
precision and mixed input/output types and precisions permits [...] to implement some algorithms 
that are simpler, more accurate, and sometimes faster." In particular, the internal use of higher 
precision provides the library developer with extra precision and a wider range of values, which 
may benefit the accuracy and robustness of numerical routines. In sharp contrast to software that 
uses arbitrary precision to obtain any desired accuracy, the use of higher precision should not lower 
performance significantly if at all. In this paper, we employ mixed precisions to improve not only the 
accuracy, but also the robustness and scalability of eigensolvers based on the algorithm of Multiple 
Relatively Robust Representations (MRRR or MR'' for short) [12, 3, 36, 31, 32, 37 . 

Direct methods for generalized and standard Hermitian eigenproblems often rely on a reduction 
to real symmetric tridiagonal form fP . Once the problem is transformed, the real symmetric tridi- 
agonal eigenproblem (STEP) is the following: Given a tridiagonal matrix T € M"^" (with T = T*, 
where T* denotes the transpose of T), find quantities A G M and nonzero z £ M" such that the 
equation 

Tz^ Xz 

holds. Without loss of generality, we assume ||z|| = 1 hereafter, where ||»|| denotes the 2-norm. 
For such a solution, A is called an eigenvalue (of T) and z an associated eigenvector; an eigenvalue 
together with an associated eigenvector are said to form an eigenpair, (A, z). The Spectral Theo- 
rem [24] ensures the existence of n eigenpairs (Ai, Zi), i e {1, 2, . . . , n}, such that the eigenvectors 
form a complete orthonormal set; that is, for all i, j e {1, 2, . . . , n}, 

1 iij = i, 
iij^i. 
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Since all eigenvalues are real, they can be ordered: 

Ai < A2 < . . . < A, < . . . < A„ , 

where Xi is the i-th smallest eigenvalue of T. In this paper, whenever the underlying matrix is not 
clear, we write Xi[T] explicitly. The set of all eigenvalues of T is denoted spec[T] and the spectral 
diameter defined as spdiam[T] — A„ — Ai. For a given index set X C {1, 2, . . . , n}, 

Zx ~ spanjzi : i £ X} 

denotes the invariant subspace associated with I. As with the eigenvalues, whenever the underlying 
matrix is not understood from context, we write 2x[T] explicitly. 

In many applications, only a subset of eigenpairs are of interest and need to be computed. For 
the computed eigenpairs, (Ai,Zj), ||zj|| = 1 and i Gl Q {1, 2, . . . ,n}, the accuracy of the results 
can be quantified by the largest residual norm and the orthogonality, we respectively defined as 

it = max 7——, and C = max max z, Zi . (l-l) 

»ei ||T||i iei jei ^ ^ 

A number of excellent algorithms for the STEP have been discovered. Among them. Bisection 
and Inverse Iteration (BI) ^^, the QR algorithm (QR) ^^, Divide & Conquer (DC) gUISlin], 
and the focus of this study, MRRR [3 [H HI [H HZl [37] . These methods differ in various aspects: 
the number of floating point operations (flops) they perform, the flop-rate at which the operations 
are executed, the amount of memory required, the possibility of computing subsets of eigenpairs at 
reduced cost, the attainable accuracy, the simplicity and robustness of the code, and the suitability 
for parallel computations. Thus, the "best" algorithm is influenced by factors such as the problem 
(e.g., dimension, subset, spectral distribution), the architecture (e.g., cache sizes, parallelism), 
external libraries (e.g., Basic Linear Algebra Subprograms), and the specific implementation of the 
algorithm (e.g., thresholds, optimizations). 

Demmel et al. [6] provide a detailed study of the performance and accuracy of LAPACK's [1] 
implementations of these four methods on various architectures. They conclude that (i) DC and 
QR are the most accurate algorithms; (m) DC requires 0{n'^) additional memory and therefore 
much more than all the other algorithmsjj (Hi) DC and MRRR are much faster than QR and BI; 
despite the fact that MRRR uses the fewest flops, DC is faster on certain classes of matrices. If 
the full eigendecomposition is desired, DC is generally the method of choice, but whether DC or 
MRRR is faster depends on the spectral distribution of the input matrix; and (iv) if only a subset of 
eigenpairs is desired, MRRR is the method of choice. The study is limited to sequential executions 
and does not take into account the degree of parallelism the algorithms provide. However, various 
studies O [36l [32l [SH [31] of the performance and accuracy of parallel implementations come to 
similar conclusions. 

To summarize, if all eigenpairs are computed, depending on the spectral distribution of the 
input matrix, either DC or MRRR is the fastest method. If only a subset of eigenpairs is desired, 
MRRR is the method of choice. Unfortunately, MRRR delivers generally the least accurate results. 
These observations carry over to direct methods for the dense eigenproblem based on a reduction 



^Here and in the following, we use the notation 0{x) informally as "of the order of x in magnitude." The notion 
is used to hide moderate constants that are of no particular interest to our discussion. 
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to tridiagonal form. It is natural to ask whether the accuracy of the MRRR-based routines can 
be improved to levels of other methods like QR or DC. Unfortunately, a general analysis of any 
MRRR implementation shows that, even if all requirements of the algorithm are fulfilled, one needs 
to expect orthogonality of ©(lOOOne) - with unit roundoff e [S^ . Methods like QR and DC, however, 
attain superior results with orthogonality of 0{£^/n). In this paper, we present a practical solution 
that improves the accuracy of MRRR. As a result, it becomes equally or more accurate than QR 
and DC. 

Our solution resorts to the use of higher precision arithmetic. The motivation is twofold: (i) 
MRRR is frequently the fastest algorithm and it might be possible to trade (some of) its performance 
to obtain higher accuracy; {ii) often MRRR is used in the context of direct methods for Hermitian 
eigenproblems. While the tridiagonal stage is responsible for much of the "loss" of orthogonality in 
the final result, it has a lower complexity than the reduction to tridiagonal form. Thus, even if it 
is necessary to spend more time in the tridiagonal stage to improve accuracy, for sufficiently large 
matrices, the overall run time will not be affected significantly. As MRRR does not make use of any 
level-3 Basic Linear Algebra Subprograms (BLAS) [13] . we do not require in our mixed precision 
approach any optimized BLAS library for high precision, which might not be available. 

For any MRRR solver, we present how the use of mixed precisions leads to more accurate 
results at very little or even no extra costs in terms of performance. As a consequence, MRRR is 
not only one of the fastest methods, but also becomes as accurate or even more accurate than the 
competition. Moreover, for direct methods based on a reduction to tridiagonal form and MRRR, the 
tridiagonal eigensolver is responsible for the inferior orthogonality compared with other methods. 
These solvers benefit directly from our approach. 

1.1. Other related work and outline. The term mixed precision algorithm is sometimes 
synonymously used for the following procedure: First solve the problem using a fast low-precision 
arithmetic, and then refine the result to high accuracy using a high-precision arithmetic. This mixed 
precision iterative refinement approach exploits the fact that there might exist a low-precision 
arithmetic faster than that of the input/output data format. The larger the performance gap 
between the two arithmetic, the more beneficial the approach. Iterative refinement (with and 
without using mixed precisions) has been most extensively studied for the solution of linear systems 
of equations |18j , but other operations such as the solution of Lyapunov equations also benefit from 
itE). 

We use of the term mixed precision in its more general form; that is, using two or more different 
precisions for solving a problem. In particular, we use a higher precision in the more sensitive parts 
of an algorithm to obtain accuracy, which otherwise could not be achieved. This approach is 
especially effective if the sensitive portion of the algorithm and/or the performance gap between 
the two arithmetic is small. Similarly to the mixed precision ideas for iterative refinement, the 
approach is quite general and we believe it can benefit computations in numerous areas. 

The rest of the paper is organized as follows: In Section [21 we present the MRRR algorithm 
and its accuracy limitations. The section mainly serves as a vehicle to introduce the factors that 
influence accuracy, which are summarized in Theorem 12.11 The derivation of the error bounds, 
which can be found in 38:, is not important for the understanding of our discussion. In Section [S] 
we detail our mixed precision approach in a general setting. Besides presenting a way to improve 
accuracy, we investigate the effects on memory usage, robustness, and scalability. In Section |4l 
we comment on an actual implementation of our approach and elaborate on a number of practical 
issues. Finally, we present experimental results of our mixed precision solvers in Section [5j 
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2. The MRRR algorithm. In this section, we present the MRRR algorithm to the detail 
necessary for the later discussion. Our exposition is largely based on [38l |40l |9] and a high-level 
description of the method given in Algorithm [TJ We comment more thoroughly on various parts 
of the computation in the following. The goal is to present Theorem 12.11 - i.e., the factors that 
influence the accuracy of any implementation of MRRR. 

Preprocessing. Algorithm [1] assumes that the necessary preprocessing is already performed; 
this includes the scaling of the entries, and the so called splitting of the input matrix into principal 
submatrices if off-diagonal entries are sufficiently small in magnitude |24| . In this section, without 
any loss is generality, we assume that the input matrix is (numerically) irreducible, i.e., no off- 
diagonal entry is "small enough" in magnitude that warrants setting it to zero. The exact criterion 
is specified later. 

Algorithm 1 MRRR 

Input: Irreducible symmetric tridiagonal T £ R"^"; index set Ii„ C {1, . . . , n}. 
Output: Eigenpairs (Ai, Zi) with i £ Tin- 

1: Select shift /x £ R and compute Mroot = T — fil. 

2: Perturb Mroot by a "random" relative amount bounded by a small multiple of e. 

3: Compute Xi [Mroot] with i £ Ti„ to relative accuracy sufficient for classification. 

4: Form a work queue Q and enqueue task {Mroot,1in-,lJ'}- 

5: while Q not empty do 

6: Dequeue a task {M, I, a}. 

7: Partition I = Ur=i-^'' according to the separation of the eigenvalues. 

8: for r = 1 to R do 

9: if Ir = {i} then 

10: // process well-separated eigenvalue associated with singleton Xr // 

11: Perform Rayleigh quotient iteration (guarded by bisection) to obtain eigenpair (A,;[-^]i -Zj) 

with sufficiently small residual norm, \\Mzi — \i[M] Zi\\. 

12: Return \i[T] = \i[M] + a and Zi. 

13: else 

14: // process cluster associated with Xr / / 

15: Select shift r £ R and compute MaMfted ~ M — tI. 

16: Refine Xi [Mshtfted] with i £Xr to sufficient relative accuracy. 

17: Enqueue {Mshif ted, Xr, a + t}. 

18: end if 

19: end for 

20: end while 

Choice of representations. In order for Algorithm [T] to work, the representation of tridiagonals 
(i.e., Mroot and M shifted) by their diagonal and off-diagonal entries must be abandoned and alter- 
native representations must be used. Any 2n — 1 (or less) scalars together with a mapping that 
define the entries of a symmetric tridiagonal is called a representation [,38, . We distinguish between 
the data of the representation, which are floating point numbers, and the underlying tridiagonal, 
which is generally not exactly representable in the same finite precision format. There are multiple 
candidates - existence assumed - for providing representations of tridiagonals: 

1. Lower bidiagonal factorizations of the form T = LDL*, and upper bidiagonal factor- 
izations of the form T — UVtU* , where D — diag(di, ^2, . . . ,d„) G M"^" and fi = 
diag(a;i,a;2, . . . ,a;„) G R"^" are diagonal, L £ M"^" and U £ K"^" are respectively unit 
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lower bidiagonal and unit upper bidiagonal. 

2. A generalization of the above are the so called twisted factorizations or BABE- 
factorizations 1261 . T = NkA.kN^, where k denotes the twist index. The k x k leading 
principle submatrix of Nk G R"^" is unit lower bidiagonal (determined by the non-trivial 
entries £i, . . . ,£k-i), and the (n — fc + 1) x (n — fc + 1) trailing principle submatrix of Nk 
is unit upper bidiagonal (determined by the non-trivial entries Uk,- ■ ■ ,Un-i)] the matrix 
Afc == diag(di, . . . ,dk-i,jk,^k+i, ■ ■ ■ ,w„) € R"^" is diagonal. Although it was known that 
these factorizations can additionally serve as representations of the intermediate matri- 
ces [3 [To] , their benefits were only demonstrated recently [37l [40] . Due to their additional 
degree of freedom in choosing k, the twisted factorizations are superior to lower or upper 
bidiagonal factorizations. Besides representing intermediate tridiagonals, twisted factoriza- 
tions are essential in computing accurate eigenvectors [26l [10] . 

3. Blocked factorizations are further generalizations of bidiagonal and twisted factorizations. 
The quantities D, ft, and A^ are block diagonal, with blocks of size 1 x 1 or 2 x 2. The 
other factors - L, U, and Nk - are partitioned conformally, with one or the 2x2 identity 
as diagonals. These types of factorizations contain the unblocked bidiagonal and twisted 
factorizations as special cases. With their great flexibility, the blocked factorizations have 
been used very successfully within the MRRR algorithm [37l [39] . 

All these factorizations are determined by 2n— 1 scalars, the data. For instance, for lower bidiagonal 
factorizations, the 2n— 1 floating point numbers di, . . . , d„, £i, . . . , £n~i determine a tridiagonal; such 
representation by the non-trivial entries of the factorization is called an iV-representation. Similarly, 
the floating point numbers di, . . . , (i„, ei, . . . , e„_i represent a tridiagonal - with e^ = diii, 1 < i < 
n — 1, being T's off-diagonal elements; such representation, including T's off-diagonal elements, 
is called an e-representation. Besides the N- and the e-representation, the Z-representation is 
introduced in [JD] for bidiagonal and twisted factorizations. For blocked factorizations, a variant of 
the e-representation is commonly used [39] . Other quantities that are computed using the (primary) 
data are called secondary or derived data. For instance, T's off-diagonal elements are secondary 
for an iV-representation while being primary for an e-representation. While the details are not 
relevant for our discussion, it is important to note that there are different variants to represent 
tridiagonals - each one with slightly different properties. Subsequently, we do not distinguish 
between the representation of a tridiagonal and the tridiagonal itself; that is, it is always implied 
that tridiagonals are represented in one of the above forms. 

The representation tree. The unfolding of Algorithm [1] is best described as a tree of representa- 
tions J, ,9) .38i . Each task {M, X, a} or just {M,I} (Line [6] of Algorithm [ij is connect to a node in 
the tree; that is, all nodes consist of a representation and an index set. {Mroot,^in} is associated 
with the root node (hence the name). All other tasks are connected to ordinary nodes. Each node 
has a depth associated with it: the number of edges on the unique path to it from the root. The 
maximum depth for all nodes is denoted dmax ■ The edges connecting internal nodes are associated 
with the spectrum shifts r that are performed in Line [15] of Algorithm [TJ 

Factors influencing MRRR 's accuracy. The analysis in j38j - a streamlined version of the proofs 
in [HI [in] ~ shows that, if suitable representations are found, the computed eigenpairs enjoy a small 
residual norm and are mutually (numerically) orthogonal. 

Theorem 2.1 (Accuracy). Let Xi[Mroot] be computed (exactly Jq by applying the spectrum shifts 
to eigenvalue Xi[M] obtained by the Rayleigh quotient iteration in Line \ll\ of AlaorithmUl Provided 



^The assumption can be removed; we simply stated the theorem as it can be found in 1371 138 | 
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all the requirements, which are discussed below, are satisfied, it holds 

\\Mrootz^~\[Mroot]z^\\ < {\\r'-'°''''^ \\ + J spdiam[Mroot]) ^-^ 

with 7 = keigU {dmaxi^i + Ct) + ct) + 2(dmax + 1)?^- Furthermore, we have for any computed eigen- 
vectors Zi and Zj, i ^ j, 



gaptol 

where TZne — k^rna / gaptol + krsne/ gaptol + 77. A proof of the theorem can be found in |37[ 138) . 
Provided the representation of Mroot is computed in a backward stable manner, a small residual 
norm with respect to Mroot implies a small residual, 0(ne||r||), with respect to the input matrix 
T. 

The rest of this section serves to convey the meaning of all the parameters involved in the 
theorem. In Scction[31 we furthermore discuss their effects on performance, robustness, and parallel 
scalability. 

Shifting the spectrum (S,^tS,\)- The spectrum shifts of Line [15] leave the eigenvectors unchanged 
in exact arithmetic; this invariance is lost in finite precision. An essential ingredient of MRRR 
is the use of special forms of Rutishauser's Quotienten-Differenzen (qd) algorithm to perform the 
shifts. Given a representation for M, we require that the representation for Mghifted — M — tI is 
computed in an element-wise mixed relative stable way, i.e., M shifted = M — tI holds exactly for 
small element- wise relative perturbations of the data for M shifted and M . For all shifts performed 
in the algorithm, these perturbations must be bounded by ^f = 0{e) and ^4. — 0(e), respectively. 
Algorithms that implement the spectrum shifts for different forms of representations are presented 

in [lomniisn]- 

Requirements on the representations (krr,keig). In order to ensure that the computed eigen- 
pairs enjoy small residual norms with respect to the input matrix and that the eigenvectors are 
numerically orthogonal, the representations in Line 1151 Mshifted = M — tI, need to be chosen 
with care. By selecting appropriate shifts r, representations that are relatively robust and exhibit 
conditional element growth are selected. Before we define the meaning of these two concepts, we 
give a brief definition of a relative gap. 

Definition 2.2 (Relative gap). Let T G K"^" be an irreducible symmetric tridiagonal matrix 
with eigenvalues {Xi : 1 < i < n} and let X C {l,2,...,n} be an index set. The relative gap 
connected to X is defined as 

relgap(X) = min I ^ ' ■.ieX,j(^X 

where quantities \Xj — Xi\/\Xi\ are 00 if Xi — 0. 

Definition 2.3 (Relative robustness). Let T (given by any representation), Xi andX he as in 
the previous definition. Furthermore, let Zx be the invariant subspace associated with X. We say 
that the representation of T is relatively robust for X if for all element-wise relative perturbation in 
the data bounded by ^ ^ 1 and for all i (z X, we have 

|Aj - Ail < krrn^\Xi\ , 

sinZ(ii,Zi) < /'' , , , 
relgap(X) 
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where Xi and Zj denote the eigenvalues and the corresponding invariant subspaces of the perturbed 
matrices, respectively; Z(Zx,2x) denotes the largest principle angle; and k^r is moderate constant, 
say about 10^ 

Definition 2.4 (Conditional element growth). A representation for a real symmetric tridiag- 
onal, M , exhibits conditional element growth with respect to the index set X C {1, . . . , n}, if for any 
element-wise relative perturbation in the data bounded by ^ <^ 1 (leading to a perturbed tridiagonal 
M) and each i €l, it holds 

||Af — Af II < spdiam[Afrooi] , and 
\\{M - Af)zJ < keigU^ ■ spdia.rR[Mroot] , 

where Zi are the computed eigenvectors, and keig is a moderate constant, say about 10. 

In Line [15] of Algorithm [l] we need to ensure that M shifted as well as M are relatively robust 
for Ir and that Mshifted features conditional element growth for Xr- In this paper, we are not 
concerned how to ensure that the involved representations satisfy the requirements; this is the topic 
of [25J [571 [ini [3H] O We remark however that there exist the danger that no suitable representation 
that passes the test for the requirements can be found. In this case, it is common to select a 
promising representation, which might not fulfill the requirements. As a consequence, the accuracy 
of Theorem 12. II is not guaranteed anymore. 

Classification of the eigenvalues (gaptol). While the above requirements on the representations 
pose a restriction on the choice of shifts p, and r in Lines [T] and [TSj the main goal is to chose shifts 
such that, in the next iteration, the partitioning X = \_}^^iXr, splits the index set into at least two 
subsets so that progress in the algorithm is guaranteed. The partitioning is done according to the 
separation of the eigenvalues and must ensure two requirements: For a given tolerance gaptol, say 
10""^, (i) relgapilr) > gaptol , and (ii) whenever Z^ — {i} is a singleton, relgap{Xi) > gaptol. The 
latter relative gap is thereby defined as 

relgap{Xi) = — " with gap{Xi) = min \ \Xi - Xj\\ . 
I Ail j#» I. J 

For all z G I, let A^ denote the midpoint point of a computed interval of uncertainty [Aj, A^] 
containing eigenvalue Xi . To achieve the desired partitioning of X, let j, j + 1 G I and define 



reldist{j,j + 1) = 



Aj+i - Aj 



max{|Aj-|,|Aj|,|Aj-+i|,|Aj+i|} 



as a measure of the relative gap. If reldist{j,j + 1) > gaptol, then j and j + 1 belong to different 
subsets of the partition. Additionally, this criterion based on the relative separation can be amended 
by a criterion based on the absolute separation of the eigenvalues [35) . After partitioning, each index 
setir with \Tr\ > 1 is associated with a cluster of eigenvalues, {Ai : i S 2^.}. Similarly, each singleton 
Xr = {i} is associated with a well- separated eigenvalue Xi. 



''According to [371138) . the requirement on the eigenvalues can be removed entirely; for X = {i}, by Theorem 1 2. 5 1 
stated below, the second condition implies the first up to a small constant provided gapiXi) Ri gap{{i\). Similarly, 
if I = {p, ...,(?} is not a singleton, the second term implies that A^ £ [Ap — krrn£\\p\, Xq — krrn^\Xq\] for all i £ I. 

■'In particular, it is not necessary to compute the eigenvectors in order to give bounds on the conditional element 
growth. 
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In order to reliably classify the eigenvalues, they should be approximated in Lines [51 and [TBI to 
relative accuracy of about gaptol: that is, at least 

\k ~ Xt\ < gaptol ■ \X^\ . (2.1) 

The above criterion can be relaxed for eigenvalues with a large gap to the rest of the spectrum [12] . 
Commonly, the eigenvalues are computed by some form of bisection |24j . In particular, in Line 1161 
we already have good approximations to the eigenvalues, which can be refined by bisection to the 
desired accuracy. 

The parameter gaptol is so important that it influences almost all parts of the algorithm. 
Since the error bounds in Theorem 12.11 are proportional to 1/ gaptol, the value indicates how much 
accuracy we are willing to lose in the computation. For many applications, this limits the choice to 
values larger than about 10"'^. However, we cannot use values much larger than 10"'^ as otherwise 
it becomes impossible to make progress by breaking clusters. 

As a side note: the condition relgapiTr) > gaptol, together with the mixed relatively stable 
computation of the spectrum shifts, implies that the associated invariant subspaces are not per- 
turbed too much due to rounding errors, i.e., sin Z{Zx^[Mshif ted], 2x^[M]) < krrn{£,i + ^^)/ gaptol. 
After shifting, we can therefore hope to compute an orthonormal basis for such a subspace, which 
is automatically numerically orthogonal to the subspace spanned by the other eigenvectors. This is 
the main idea behind the MRRR algorithm. 

Rayleigh quotient iteration (krs,OL,rj). Finally, in Line 1111 of Algorithm [Tl eigenpairs of well- 
separated eigenvalues are computed via the Rayleigh quotient iteration (RQI). Given an approxima- 
tion Ai[M] and a representation M that is relatively robust for {«}, a key ingredient of MRRR is the 
ability to compute an accurate eigenvector approximation Zi such that sinZ(zi, Zi) — 0{ne/ gaptol); 
see [10] for a proof. This is certainly achieved by driving the local residual norm below a specified 
threshold 

||^(iocaO|| ^ \\Mz, - A,[M] z,|| < krs ■ gap (a,[M]) • ^-^ , (2.2) 

where krs is 0(1). In this case, the so called Gap Theorem gives the desired bound on the error 
angle Z(zj,Zi). 

Theorem 2.5 (Gap Theorem). Given a symmetric matrix T g R"^" and an approximation 
{X,z), \\z\\ — 1, to the eigenpair (A, z), with A closer to A than to any other eigenvalue, let r be the 
residual Tz — \z; then 

sinZ(£,z)<^lL. (2.3) 

gap[X) 

The residual norm is minimized if X is the Rayleigh quotient of z, A = z*Tz. In this case, 



IkIP ] 



,'. " .^1 <sinZ(z,z) and |A - A| < min <^ ||r||, ^^ ^ . (2.4) 

spdtam,[T\ \^ gap{X) J 

A proof of the theorem can be found for instance in [531 131] ■ 

In general, a residual norm such as in (J2.2I) cannot be guaranteed; it is only possible to show 
that it holds for a small element-wise relative perturbation of the data of M bounded by a and 
the computed eigenvector Zi bounded by rj - with a = ©(e) and rj = 0{ne). For our purposes, 
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this detail is not important. Nonetheless, Theorem 12.11 takes this fact into account. Note that 
even in rare cases where (|2.2[) is not fulfilled, the small error angle together with (|2.4p imply 
||^(/oca/)|| ^ 0{spdiam[M] ■ ne/gaptol). 

In the RQI, the j-th iteration consists of four steps: (i) For all 1 < fc < n, compute the 
twisted factorizations Nk^kNk — M — A^ I, (m) determine s = argmiuj, |7fc|, where jk is the fc-th 
element of Ak (see above); (in) solve the linear system NgAgN* z^' — ^s^s, which is equivalent to 
the system N* z^' = eg] (iv) use the Rayleigh quotient correction term to update the eigenvalue 

^i — ^i + 7s/Pi IP- The residual norm is approximated by |7s|/Pi || and the process 

is stopped if (j2.2p is satisfied. In order to always converge, the stopping criterion is amended 



and the iteration stopped when Ai is not improved anymore, i.e., |7s|/p, IP — 0{s\Xi\) :22;. 
An alternative approach to RQI is to refine the eigenvalue approximation to full precision (i.e., 
|Ai — Ail = 0{ns\Xi\)), and then perform only a single step of RQI. This approach is used whenever 
RQI fails to converge to the correct eigenvalue [12] . 

3. A mixed precision approach for MRRR. The exact values of the parameters in The- 
orem 12.11 differ slightly for various implementations of the algorithm and need not to be known 
exactly in the following analysis. The bounds on the residual norm and orthogonality are theoreti- 
cal. It is useful to translate what the bounds mean in practice: with reasonable parameters, realistic 
practical bounds on the residual norm and on the orthogonality are ne and lOOOne, respectively. 
In order to obtain accuracy similar to that of the best available methods, we need to trade the 
dependence on n by a dependence on ^/n. Furthermore, it is necessary to reduce the orthogonality 
by about three orders of magnitude. 

3.1. A solver using mixed precisions. The technique is simple, yet powerful: Inside the 
algorithm, we use a precision higher than of the input/output in order to improve accuracy. A 
similar idea was already mentioned in [7] , in relation to a preliminary version of the MRRR algo- 
rithm, but was never pursued further. With many implementation and algorithmic advances since 
then (e.g., [HI [TTJ |3J UHl I3S] ) : it is appropriate to investigate the approach in detail. To this end, 
we build a tridiagonal eigensolver that differentiates between two precisions: (i) the input/output 
precision, say binary_x, and (ii) the working precision, binary_y, with y > x. li y = x, we have 
the original situation of a solver based on one precision; in this case, the following analysis is easily 
adapted to situations in which we are satisfied with less accuracy than achievable by MRRR in x-bit 
arithmetic. Since we are interested in accuracy that cannot be accomplished in a;-bit arithmetic, 
we restrict ourselves to the case y > x. Provided the unit roundoff of the y-hit format is sufficiently 
smaller than the unit roundoff of the a:-bit format, say four or five orders of magnitude, we show 
how to obtain, for practical matrix sizes, improved accuracy to the desired level. 

Although any z-bit and y-bit floating point format might be chosen, in practice, only those 
shown in Table [5TT] are used in high-performance libraries. For example, for a binary32 input/output 
format (single precision), we might use a binary64 working format (double precision). Similarly, 
for a binary64 input/output format, we might use a binarySO or binaryl28 working format (ex- 
tended or quadruple precision). For these three configurations, we use the terms single/ double, 
double/ extended, and double /quadruple. Practical issues for their implementation are discussed in 
Section 131 In this section, however, we concentrate on the generic case of binary_x/binary_y. In 
general, when we refer to binary _x, we mean both the x-bit data type and its unit roundoff Ex. 

In principle, we could perform the entire computation in y-hit arithmetic and, at the end, cast 
the results to form the x-bit output; for all practical purposes, we would obtain improved results 
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Name 



IEEE-754 Precision Support 



single 


binary32 


Ss -■ 


^ 2-24 


Hardware 


double 


binary64 


Sd ^ 


'2~^^ 


Hardware 


extended 


binarySO 




^2-64 


Hardware 


quadruple 


binaryl28 


t-q - 


= 2-"3 


Software 



Table 3.1 
The various floating point formats used and their support on common hardware. The e-terms denote the unit 
roundoff error (for rounding to nearest). We use the letters s, d, e and q synonymously with 32, 64, 80, and 128. 
For instance, e,, = £3. 



as desired. This naive approach, however, is not satisfactory for two reasons: (i) the memory 
requirement is increased, since the eigenvectors need to be stored explicitly in the binary_y format; 
and more importantly, (ii) if the y-bit floating point arithmetic is much slower than the a;-bit 
one, the performance suffers severely. While the first issue is addressed rather easily (as discussed 
Section l3.3p . the latter requires more care. The key insight is that it is unnecessary to compute 
eigenpairs with residual norms and orthogonality bounded by 0{ney); instead, these bounds are 



relaxed to 0{£x\fn) (for example, think of e^ 



10- 



10 



-34 



and 



1,000). While in a 



conventional implementation the choice of parameters is very restricted, as we show below, we gain 
enormous freedom in their choice. In particular, while meeting our new accuracy goals, we select 
values such that the amount of necessary computation is reduced, the robustness is increased, and 
parallelism is improved. As our following analysis shows, we can emphasize the importance of any 
of those features. 

3.2. Adjusting the algorithm. Consider the input/output being in a a;-bit format and 
the entire computation being performed in y-bit arithmetic. Starting from this configuration, we 
expose the new freedom in the choice of several parameters and justify other changes made to the 
algorithm. For example, we identify parts that can be executed in a;-bit arithmetic, which might 
be considerably faster. 

Assuming £y <C £x (again, think of e^ ~ 10^^^ and £y ~ lO"^**), we simplify Theorem 12.11 bv 
canceling terms that are insignificant even with adjusted parameters (i.e., terms that are comparable 
to £y in magnitudqj). In our argumentation, we hide all constants, which anyway correspond to 
the bounds attainable for a solver purely based on hinary_y. For any reasonable implementation of 
the algorithm, we have the following: a ~ 0{ey), -q ~ 0{ney), ^4, = 0{ey), ^f = 0{ey). Thus, the 
orthogonality of the final result is given by 



'^'■"^l = ^('-^ 



nsy 



^rr ^7nax 



.] 



(3.1) 

(3.2) 

with ||r('°=''')|l < krs gap (A»[Af]) ^^ and 7 = Oik^ig d^ax nSy). 

We now provide a list of changes that can be done to the algorithm. We discuss their effects 
on performance, parallelism, and memory requirement. 



gaptol J 
Similarly, for the bound on the residual norm, we get 

\\MrootZ^ - k[Mroot] ^ - O (\\A'°^^''> \\ + J Spdiam[Mroot] 



^In particular, we require that nsy < ex\/n. 
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Preprocessing. We assume scaling and splitting is done as in a solver purely based on x-bit 
floating point arithmetic. In particular, off-diagonal element e^ of the input, 1 < i < n — 1, is set 
to zero whenever 

\e^\<e,\\T\\, 

where n and T refer to the unreduced input |f| We remark that this criterion is less strict than setting 
elements to zero whenever \ei\ < ey|jr||. Splitting the input matrix into submatrices is beneficial 
for both performance and accuracy as these are mainly determined by the largest submatrix. In 
the rest of this section, we assume that the preprocessing has been done and each subproblem is 
treated independently by invoking Algorithm [T] In particular, whenever we refer to matrix T, it 
is assumed to be irreducible; whenever we reference the matrix size n in the context of parameter 
settings, it refers to the size of the processed block. 

Choice of representations. For difi'erent forms of representing tridiagonals (e.g., bidiagonal, 
twisted, or blocked factorizations) and their data (e.g., N-, e-, or Z-representation), different algo- 
rithms implement the shift operation: Mghifted = M — tI. All these algorithms are stable in the 
sense that the relation holds exactly if the data for Mshifted and M are perturbed element-wise 
by a relative amount bounded by 0{ey). The implied constants for the perturbation bounds vary 
slightly. As Sy <C £x, instead of concentrating on accuracy issues, we make our choice based on 
robustness and performance. A discussion of performance issues related to different forms of the 
representations can be found in [301 ^\ . Based on this discussion, it appears that twisted factor- 
izations with e-representation seem to be a reasonable choice. As the off-diagonal entries of all 
the matrices stay the same, they only need to be stored once and are reused during the entire 
computation. 

Random perturbations. In Line [2] of Algorithm (H to break up tight clusters.the data of Mroot, 
{xi, ...,X2n-i}, is perturbed element-wise by small random relative amountsij Xi = Xi{l + ^i) 
with l^il < ^ for all 1 < « < 2n— 1. In practice, a value like ^ = Se is used. Although our 
data is in binary_y, we are quite aggressive and adopt ^ = e^; or a small multiple of it. Thus, for 
y = 2x, about half of the digits in each entry of the representation are chosen randomly; therefore, 
with high probability, eigenvalues do not agree to more than about — logj^g ^x digits. This has 
two major effects: (i) together with the changes in gaptol (see below), in practice, the probability 
to encounter dmax > 1 becomes very low, and (ii) it becomes easier to find suitable shifts such 
that the resulting representation satisfies the requirements of relative robustness and conditional 
element growth. The positive impact of small dmax on the accuracy is apparent from (|3.1|) and 
p.2|) . Furthermore, as discussed below, due to limiting dmax, the computation can be reorganized 
for efficiency. Although it might look innocent, the more aggressive random perturbations lead to 
much improved robustness: A detailed discussion can be found in [TT|Pl 

Classification of the eigenvalues. Due to the importance of the ^apioZ-parameter, adjusting it 
to our requirements is key to the success of our approach. The parameter infiuences nearly all 
stages of the algorithm; most importantly, the classification of eigenvalues into well-separated and 
clustered. As already discussed, the choice of gaptol is restricted by the loss of orthogonality that 
we are willing to accept; in practice, the value is often chosen to be 10"'^ [9 O As we merely require 
orthogonality of ex\/n, we accept more than three orders of magnitude loss of orthogonality. Both 



^We can relax the condition further and use \ei\ < eaj-v/nHTH. 

True randomness is not necessary; any (fixed) sequence of pseudo-random numbers can be used. 
*For a quantitative assessment of robustness, see 1291 . 
^For instance, LAPACK's DSTEMR uses 10"^, while SSTEMR uses 3 ■ 10"^. 
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terms in (13.11) (and the in practice observed orthogonality) are proportional to nSy/gaptol. This 
means that the value of gaptol can be chosen as small as ey^/n/ex- As a consequence, we might 
select any value satisfying 

mm <; 10"'\ ^^ I < gaptol < 10"^ , (3.3) 



where the 10~^ terms are derived from practice and might be altered slightly. Note that gaptol 
potentially becomes as small as 10~^^/n in the single/double case and 10~^^y/n in the dou- 
ble/quadruple one. If we restrict the analysis to matrices with size n < 10^, we can choose a 
constant gaptol as small as 10~^ and 10"^^ respectively for the single/double and double/quadruple 
cases. 

With any choice of gaptol complying (13. 3p . accuracy to the desired level is guaranteed, and there 
is room to choose the specific value of gaptol, as well as other parameters, to optimize performance 
or parallelism. In particular, by generally reducing the clustering of the eigenvalues, the smallest 
possible value oi gaptol provides the greatest parallelism. To quantify this statement, for any matrix, 
we define clustering p e [l/?^, 1] formally as the size of the largest cluster divided by the matrix 
size. There are two main advantages in decreasing p: (i) the work is reduced as processing the 
largest cluster introduces 0{pn^) flops extra work, and (ii) the potential parallelism is increased. 
A conservative estimate of the parallelism of a problem is provided by p~^. For instance, p = 1/n 
implies that the problem is embarrassingly parallel. The estimate of parallelism assumes that 
clusters are processed sequentially, while in reality the bulk of the work (the refinement of the 
eigenvalues and the final computation of eigenpairs) can be parallelized. Nonetheless, matrices 
with high clustering still pose difficulties to MRRR as they introduce load-balancing issues and 
communication, which considerably reduce the parallel scalability [36l [35l [29j. Therefore, even 
if we did not have the desire to guarantee improved accuracy of the method, we could use the 
mixed precision approach to significantly enhance parallelism. In this case, the y^-dependence on 
the lower bound for the value of gaptol would be removed and the bound could be loosened by 
another three orders of magnitude; that is, we could choose a value of 10^^^ and 10^^^ for the 
single/double and double/quadruple case, respectively!^ Consequently, almost all computations 
become embarrassingly parallel. 

As an example. Table [3^ shows the clustering for double precision Hermite typqlj test matrices 
of various sizes with four distinct classification criteria|13 (I) gaptol = 10~^, (II) gaptol = 10~^, 
combined with splitting based on the absolute gap as proposed in |35| to enhance parallelism, 
(III) gaptol — 10^^°, and (IV) gaptol — 10^^^. For the latter two criteria, the computations are 
embarrassingly parallel. As with this example, experience shows that, thanks to a reduced value 
of gaptol as in criteria III or IV, many problems become embarrassingly parallel and guarantee 
improved accuracy. In case p = 1/n, dmax = 0, which not only benefits accuracy by p.l[) and 
p.2|) . but also has a more dramatic effect: the danger of not finding representations that satisfy the 
requirements is entirely removed. This follows from the fact that a satisfactory root representation 
is always found (e.g., by making T — pi definite) and no other representation needs to be computed. 



^"if we select values 10~^ and 10~^*, we improve the bounds by three orders of magnitude. 

^^See |23| for information on test matrices. 

^■^ Criterion I is used in LAPACK |12) and in results of mrSsnip in ;31J . which usually uses II. Criterion II is used 
in ScaLAPACK I36| and Elemental 1321 . In massively parallel computing environments, criteria III and IV can (and 
should) additionally complemented with the splitting based on absolute gaps; see also |33) . 
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Criterion 




Mati 


:ix size 






2,500 


5,000 


10,000 


20,000 


I 


0.70 


0.86 


0.93 


0.97 


II 


0.57 


0.73 


0.73 


0.73 


III 


4.00e-4 


2.00e-4 


l.OOe-4 


5.00e-5 


IV 


4.00e-4 


2.00e-4 


l.OOe-4 


5.00e-5 



Table 3.2 
The gaptol -parameter effect on clustering p S [1/n, 1] 



> 0, the number of times Line [15] of Algorithm [T] needs to be executed is 
often considerably reduced. 

On the downside, selecting a smaller gaptol can result in more work in the initial approxima- 
tiort^ and later refinements - in both cases, eigenvalues must be approximated to relative accuracy 
of about gaptol^ see (|2.ip : hence, optimal performance is often not achieved for the smallest possible 
value of gaptol. Moreover, as we discuss below, if one is willing to limit the choice of gaptol, the 
computation and refinement of eigenvalues can be done (almost) entirely in x-bit arithmetic!^ If 
y-hit arithmetic is slow, it might be best to take advantage of the faster x-bit arithmetic. And, 
as we see below as well, if not the smallest possible value is chosen for gaptol, the requirements 
the intermediate representations must fulfill are relaxed, thereby increasing the robustness of the 
method. 

Another corollary of adjusting gaptol is slightly hidden: in Line [15] of Algorithm [T] we gain 
more freedom in selecting t such that, at the next iteration, the index set Tj. splits into two or more 
subsets. For instance, when choosing r close to one end of the cluster, we are able to "back off" 
further away than usual from the end of the cluster in cases where, in a previous attempt, we did 
not find a representation satisfying the requirements [VS[. 

We cannot overemphasize the positive effects an adjusted gaptol has on robustness and parallel 
scalability. In particular, in a massively parallel computing environment, the smallest value for 
gaptol significantly improves the parallel scalability. And since many problems become embarrass- 
ingly parallel, the danger of failing to find suitable representations is entirely removed. 

Arithmetic used to approximate eigenvalues. In Lines [51 and 1 1 61 of Algorithm [Tl eigenvalues are 
respectively computed and refined to a specified relative accuracy. In both cases, we are given 
a representation, which we call My henceforth, and an index set X that indicates the eigenvalues 
that need to be approximated. When the y-bit arithmetic is much slower than the a;-bit one (say 
a factor 10 or more), the use of the latter is preferred: One creates a temporary copy of My in 
binary_x — called M^ henceforth - that is used for the eigenvalue computation in x-bit arithmetic. 
The creation of M^ corresponds to an element-wise relative perturbation of My bounded by Sx • By 
the relative robustness of the representation, 

lA^Af:,] - A,[Afy]| < krrnSx\X,[My]\. (3.4) 

For instance, bisection can be used to compute eigenvalue approximations Ai[Afr] to high relative 
accuracy, after which M^ is discard. As casting the result back to binary_y causes no additional 

^''For instance, if bisection is used to obtain initial approximations to the eigenvalues. 

^■'For the refinement of extreme eigenvalues prior to selecting shifts, we still need to resort to y-hit arithmetic. 
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error, it is Ai[Afj^] — Xi[Mx] and 

\X^[My] - A4A4]| < kMne^\\,[M^]\ , 

where ki,i is a moderate constant given by the bisection method. To first order, by the triangle 
inequality, it holds 

\K[My] - X^[My]\ < {krr + ku) ne^lA^Mj,]] . (3.5) 

Provided {krr + ^61) nSx < gaptol, by (|2.ip . a;-bit arithmetic can be used to approximate the eigen- 
values. Thus, an additional constraint on both the size n and gaptol arises: Given a gaptol, we 
must limit the matrix size up to which we do the computation purely in x-bit arithmetic. Similarly, 
for a given matrix size, we need to adjust the lower bound on gaptol in p.3p . As an example, if say 
krr < 10, kbi < 10, n < 10^, and Ex = Ed = 2~^^, it is required that that gaptol > 10~^°. When 
resorting to a;-bit arithmetic or if gaptol is chosen too small, one might respectively verify or refine 
the result of the x-bit eigenvalue computation using y-bit arithmetic without significant costs|l£| 

Requirements on the representations. As long as keignSy ^ Sx^/n, by p.2p . the residual with 
respect the Mroot is mainly influenced by the local residual. In our mixed precision approach, 
without loss of accuracy, it is possible to allow for 

keig < max \ 10, — ^ } , (3.6) 

where we assumed 10 was the original value of k^ig- As a result, the requirement on the conditional 
element growth is considerably relaxed. For instance, in the single/double and double/quadruple 
cases, assuming n < 10^, bounds on keig of about 10^ and 10^^ are sufficient, respectively, li gaptol 
is not chosen as small as possible, the bound on krr is loosened in a similar fashion: 

krr < max < 10, — ^ • gaptol > . (3.7) 

As an example, in the double/quadruple case, assuming n < 10® and gaptol set to 10~^", krr £ 10^ 
would be sufficient to ensure accuracy. 

Rayleigh quotient iteration. Our willingness to lose orthogonality up to a certain level, which is 
noticeable in the lower bound on gaptol, is also reflected in (j2.2p . As nSy/ gaptol < Sxy/n, we stop 
the RQI when 

||r('°="')|l < krs ■ gap (^X,[M]^ SxV^, (3.8) 

where krs is 0{1). In practice, we take fc^s — 1 (or even krs = 1/V^)- -'^s a consequence, the 
iteration is stopped earlier on, thereby reducing the overall work. 

As a side note: In the rare cases where RQI fails to converge (or as a general alternative to 
RQI) , we commonly resort to bisection to approximate the eigenvalue A^ and then use only one step 
of RQI (with or without applying the correction term). In the worst case, we require the eigenvalue 
to be approximated to high relative accuracy, jA; — Ai| — 0{ney\Xi\) TO]. With mixed precision, we 



^^If the first requirement in Definition 12.31 is removed, we can still make use of x-bit arithmetic although 113.51 1 
might not always be satisfied anymore. 
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relax the condition to jA^ — A^j = 0{ex\/n\Xi\ gaptol), which is less restrictive if gaptol is not chosen 
as small as possible|l£| If relgap{Xi) 3> gaptol, the restriction on the accuracy of the approximated 
eigenvalue is lifted even further [37]. 

The representation tree. Thanks to the random perturbation of the root representation and 
a properly adjusted gaptoZ-parameter, we rarely expect to see the maximum tree depth, dmax, 
exceed one. For all practical purposes, we may assume dmax < 2. As a result, the computation 
can be rearranged, as discussed in |38_ and summarized in the following: To bound the memory 
consumption, a breath- first strategy such as in Algorithm [T] is used; see for instance in [121 131). 
This means that, at any level of the representation tree, all singletons are processed before the 
clusters. A depth-first strategy would instead, process entire clusters, with the only disadvantage 
that meanwhile up to dmax representations need to be kept in memory. If dmax is limited as in 
our case, the depth-first strategy can be used without disadvantage. In fact, a depth-first strategy 
brings two advantages: (i) copying representations to and from the eigenvector matrix is avoided 
entirely (see Section 13.31 on the benefit for the mixed precision approach) and (m) if no suitable 
representation is found, there is the possibility of backtracking, that is, we process the cluster again 
by choosing different shifts at a higher level of the representation tree. For these reasons, in the 
mixed precision approach, a depth-first strategy is preferred. 

3.3. Memory cost. We stress that in our approach, both input and output are in binary_x 
format; only internally (i.e., hidden to a user) y-bit arithmetic is used. The memory management 
of an actual implementation of MRRR is affected by the fact that matrix Z e R"^'^, which on 
output contains the desired eigenvectors, is commonly used as intermediate work space. Since Z 
is in binary _x format, whenever y > x, the work space is not sufficient anymore for its customary 
use: For each index set Ir with \Xr\ > 1, a representation, M shifted, is stored in the corresponding 
columns of Z [121 [31]. As these representations consist of 2n— 1 hinaryjy numbers, this approach is 
generally not applicable. However, if we restrict to y < 2x, we can store the 2n binary_y numbers 
whenever a cluster of size four and more is encountered. Thus, the computation must be reorganized 
so that at least clusters containing less than four eigenvalues are processed without storing any data 
in Z temporarily. In fact, using a depth-first strategy, we remove the need to use Z as temporary 
storage entirely. Immediately after computing an eigenvector in binary _y, it is converted to binary _x, 
written into Z , and discarded. While our approach slightly increases the memory usage, we do not 
require much more memory: with p denoting the number of computational threads, our mixed 
precision solver still needs only 0{pn) binary_x fioating point numbers extra work space. 

4. Other practical aspects. We have implemented the mixed precision approach for three 
cases: single/double, double /extended, and double /quadruple. The first solver accepts single preci- 
sion input and produces single precision output, but internally uses (hidden to the user) double 
precision. The other two are for double precision input/output. The performance of the solvers, 
compared with the traditional implementation, depends entirely on the difference in speed between 
the two involved arithmetic. If the higher precision arithmetic is not much slower (say less than a 
factor four), the approach is expected to always work well, even for sequential executions and rela- 
tively small matrices. If the higher precision arithmetic is considerably slower, the mixed precision 
approach might still perform well for large matrices. Due to increased parallelism, our approach is 



^^Tho implied constants being the same and given by the requirement of a regular solver based on j/-bit arithmetic. 
In a similar way, we could say that the Rayleigh quotient correction does not improve the eigenvalue essentially 
anymore if |7s|/||2i|| = O{ex\\i\go,ptol/ ^/n), instead of |7s|/||2ii| = 0{sy\\i\). We never employed it as such a 
change will hardly have any effect on the computation time. 
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also expected to perform generally well on highly parallel systems. Our target application is the 
computation of a subset of eigenpairs of large-scale dense Hermitian matrices. For such a scenario, 
we tolerate a slowdown of the tridiagonal eigensolver due to mixed precisions without affecting per- 
formance significantly as the reduction to tridiagonal form is the performance bottleneck |32j [S^ • 



4.1. Our mixed precision implementations. In Section[5l we present experimental results 
of our implementations. All mixed precision implementations are based on a multi-threaded variant 
of MRRR, mrSsmp, presented in [3T1I3D], which is built on top of LAPACK's routine DSTEMR (version 
3.2). All codes use A''-representations of lower bidiagonal factorizations. Bisection is used for 
the initial eigenvalue computation if a small subset of k eigenpairs is requested or if the number 
of executing threads exceeds 12k /n [12j |3T]. If all eigenpairs are requested and the number of 
threads is less than 12, the fast sequential dqds algorithm [Ml [28] is used instead of bisection. As 
a consequence, speedups compared with the sequential execution appear less than perfect even for 
an embarrassingly parallel computation. 

We did not relax the requirements on the representations according to (|3.6p and p. 71) ; we only 
benefit from the possibility of doing so indirectly: If no suitable representation is found, the best 
candidate is chosen, which might still fulfill the relaxed requirements. 

In the following, we provide additional comments to all of the mixed precision solvers indi- 
vidually. As parameters can take a wide range of values (in particular, gaptol, but also k^r and 
keig) and several design decisions can be made, optimizing a code for performance is non-trivial 
as it generally depends on both the specific input and the architecture. While we cannot expect 
to create an "optimal" design for all input matrices and architectures, we make design decisions 
in a way that in general yields good performance. For instance, on a highly parallel machine one 
would select a small value for gaptol to increase parallelism. For testing purposes, we disabled the 
classification criterion based on the absolute gaps of the eigenvalues proposed in |35j , which might 
reduce clustering even further (it has no consequences for our test cases shown in the next section). 

Single/double. With widespread language and hardware support for double precision, the mixed 
precision approach is most easily implemented for the single/double case. In our test implemen- 
tation, we fixed gaptol to 10~^. When bisection is used, the initial eigenvalue approximation is 
done to a relative accuracy of lO^'^ • gaptol. As on most machines the double precision arithmetic 
is not more than a factor two slower than the single precision one, we carry out all computations 
in the former. Data conversion is only necessary when reading the input and writing the output. 
As a result, compared with a double precision solver using a depth- first strategy, merely a number 
of convergence criteria and thresholds must be adjusted, and the RQI must be performed using a 
temporary vector that is after convergence written into the output eigenvector matrix. The mixed 
precision code closely resembles a conventional double precision implementation of MRRR. 

Double/extended. Many current architectures have hardware support for a 80-bit extended 
floating point format (see Table [3?!]) . As the unit roundoff £e is only about three orders of magnitude 
smaller than Ed, we can improve the accuracy of MRRR by this amount. For matrices of moderate 
size, this means that the accuracy becomes comparable to that of the best methods. The main 
advantage of the extended format is that, compared with double precision, its arithmetic comes 
without any or only a small loss in speed. However, we cannot make any further adjustments in 
the algorithm, which positively effect its robustness and parallelism. We do not include results for 
the double/extended case in the next section. However, we tested the approach and experimental 
results can be found in [551 133] • 

Double/quadruple. As quadruple precision arithmetic is not widely supported by today's proces- 
sors or languages, we had to resort to software-simulated arithmetic, which is rather slow. For this 
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reason, we used double precision for the initial approximation and for the refinement of the eigen- 
values. The necessary intermediate data conversions make the mixed precision approach slightly 
more complicated to implement than the single/double one. We used the value lO"^'^ for gaptol in 
our tests. Further details can be found in l33l. 



4.2. Portability. The biggest problem of the mixed precision approach is a potential lack of 
support for the involved data types. As single and double precisions are supported by virtually 
all machines, languages, and compilers, the mixed precision approach can be incorporated to any 
linear algebra library for single precision input/output. However, for double precision input/output, 
we need to resort to either extended or quadruple precision. Not all architectures, languages, and 
compiler support these formats. For instance, the 80-bit floating point format is not supported by 
all processors. Futhermore, while the FORTRAN REAL*10 data type is a non-standard feature of 
the language and is not supported by all compilers, a C/C++ code can use the standardized long 
double data type (introduced in ISO C99) that achieves the desired result on most architectures 
that support 80-bit arithmetic. For the use of quadruple precision, there are presently two major 
drawbacks: (i) it is usually not supported in hardware, which means that one has to resort to a 
rather slow software-simulated arithmetic, and (m) the support from compilers and languages is 
rather limited. While FORTRAN has a REAL*16 data type, the quadruple precision data type 

in C/C++ is compiler-dependent: for instance, there exist the f loatl28 and _Quad data types 

for the GNU and Intel compilers, respectively. An external library implementing the software 
arithmetic might be used for portability. In all cases, the performance of quadruple arithmetic 
depends on its specific implementation. It is however likely that the hardware/software support for 
quadruple precision will be improved in the near future. 

5. Experimental Results. All tests, in this section, were run on a multi-processors system 
comprising four eight-core Intel Xeon X7550 Beckton processors, with a nominal clock speed of 
2.0 GHz. Subsequently, we refer to this machine as Beckton. We used LAPACK version 3.4.2 
and linked the library with the vendor-tuned MKL BLAS version 12.1. In addition to the results 
for LAPACK's routines and our mixed precision solvers, we also include results for mrSsmp [31} . 
All routines were compiled with Intel's compiler version 12.1 and optimization level -03 enabled. 
Although we present only results for computing all eigenpairs (LAPACK's DC does not allow 
the computation of subsets), we mention that MRRR's strength and main application lies in the 
computation of subsets of eigenpairs. 

For our tests, we used matrices of size ranging from 2,500 to 20,000 (in steps of 2,500) of six 
different types: uniform eigenvalue distribution, geometric eigenvalue distribution, 1-2-1, Clement, 
Wilkinson, and Hermite. The dimension of the Wilkinson type matrices is n + 1, as they are only 
defined for odd sizes. Details on these matrix types can be found in [23] . To help the exposition of 
the results, in the accuracy plots, the matrices are sorted by type first and then by size; vice versa, 
in the plots relative to timings, the matrices are sorted by size first and then by type. 

Figure 15.11 shows timings and accuracy for single precision inputs. As a reference, we include 
results for LAPACK's SSTEMR (MRRR) and SSTEDC (Divide & Conquer). As shown in Fig. |5.1(a)[ 



even in a sequential execution, our mixed precision approach is up to an order of magnitude faster 
than LAPACK's SSTEMR. For one type of matrices, SSTEDC is considerably faster than for all the 
others. These are the Wilkinson matrices, which represent a class of matrices that allow for heavy 
deflation within the Divide & Conquer approach. For all other matrices, which do not allow such 



extensive deflation, our solver is faster than SSTEDC. As seen in Fig. 5.1(b) in a parallel execution. 



the performance gap for the Wilkinson matrices almost entirely vanishes, while for the other matrices 
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Figure 5.1. Time and accuracy on Beckton. Timings are presented in a logarithmic scale. The largest 
residual norm and the orthogonality are measured as in HI. 11 1. The dotted black line corresponds to unit round-off 
£3. As there exist no parallel MRRR for single precision, we show timings for our mixed precision approach and 
SSTEDC only. 



our solver remains up to an order of magnitude faster than SSTEDC. As depicted in Figs. 5.1(c) (d)[ 
our routine is not only as accurate as desired but it is the most accurate one. For single precision 
input/output arguments, we obtain a solver that is more accurate and faster than the original single 
precision solver. In addition, the solver is more scalable, and more robust. In 38 out of the 48 test 
cases, SSTEMR accepted representations that did not pass the test for relative robustness, thereby 
jeopardizing the accuracy of the result. In contrast, using mixed precisions, our solver was able to 
find suitable representations in all cases. 

We now turn our attention to double precision inputs/outputs, for which timings and accuracy 
are presented in Fig. 15.21 We included the results for the multi-threaded solver mrSsmp, which in 
the sequential case is just a wrapper to DSTEMR. In general, mrSsmp obtains accuracy equivalent to 
LAPACK's DSTEMR. 



Figure 5.2(a) shows the timings for sequential executions. Our mixed precision solver is slower 



than DSTEMR, which is not a surprise, as we make use of software-simulated quadruple precision 
arithmetic. What might be a surprise is that, even with the use of such slow arithmetic, for large 
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Figure 5.2. Time and accuracy on Beckton. Timings are presented in a logarithmic scale. The largest 
residual and the orthogonality are measured as in HI. It . The dotted black line corresponds to unit round-off e^ and 
the accuracy o/ mrSsmp is equivalent to the one obtained by LAPACK's DSTEMR. 



matrices, our solver is usually as last as DSTEDC. As in the single precision case, only lor matrices 
that allow lor substantial deflation, DSTEDC is considerably faster. As Fig. 5.2(b) shows, lor a 



parallel execution, the performance difference reduces and is expected to eventually vanish as it 
does already for the a regular MRRR implementation JTT . For matrices that do not allow for 
extensive deflation, our solver is about a factor two faster than DSTEDC. 

While DSTEMR accepted in 29 out of the 48 test cases representations that did not pass the 
test for relative robustness, our mixed precision solver found suitable representations in all cases. 
In fact, lor all but the Wilkinson type matrices, we have dmax = and as a consequence: no 
danger of failing to flnd suitable representations and embarrassingly parallel computation. Even for 
Wilkinson type matrices, dmax was limited to one and clustering p was limited to 2/n. For DSTEMR, 
dmax was as high as 21 and clustering p was about 0.7 on average, which should be compared with 
the value ol about 1.6 • 10^"' lor the mixed precision solver. Therefore, we believe that our approach 
is especially well-suited lor highly parallel systems. In particular, solvers for distributed-memory 
systems should greatly bencflt from better load-balancing and reduced communication. 
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For single precision inputs [Figs. 5.1(a) - (b)| or in a parallel stetting [Fig. 5.2(b)| , our tridiagonal 
eigensolver is highly competitive in terms of execution time - often faster - compared with Divide & 
Conquer and the conventional MRRR. As a consequence, when used in context of dense Hermitian 
eigenproblems, the accuracy improvement of the tridiagonal stage carry over to the overall accuracy 
without any penalty in terms of performance. Such a behavior is illustrated by Fig. 15.31 where we 
present timings and orthogonality for dense, real symmetric input matrices. The inputs were 
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Figure 5.3. Time and orthogonality for computing all eigenpairs of dense, real symmetric matrices. Timings 
are presented in a logarithmic scale and are dominated by the reduction to tridiagonal form. 



generated by applying random orthogonal similarity transformations to the tridiagonal matrices 
of the previous experiments: A — QTQ*, with random orthogonal matrix Q G R"^". For small 
matrices in a sequential execution, our approach introduces extra overhead - see Fig. 5.2(a) Since 



the tridiagonal solver only requires 0{kn) operations to compute k eigenpairs, while the reduction to 
tridiagonal form requires 0{n^) operations, for sufficiently large matrices the overhead is completely 
negligible. Such a statement would even more apply if the matrices were complex-valued and/or 
only a .subset of eigenpairs were computed, since the reduction to tridiagonal form would carry 
even more weight relative to the tridiagonal stage. In a parallel execution, the mixed precision 
approach is competitive even for relatively small matrices [Fig. 5.3(a)| ; at the same time, the 
approach significantly improves orthogonality [Fig. 5.3(b)| . Further experiments, including subset 
computations and complex- valued inputs, can be found in [291133) . 



6. Conclusions. We presented a mixed precision variant of the MRRR algorithm, which ad- 
dresses a number potential weaknesses of MRRR such as (i) inferior accuracy compared with the 
Divide & Conquer method or the QR algorithm; [ii) the danger of not finding suitable representa- 
tions; and (iii) for distributed-memory architectures, load-balancing and communication problems 
for matrices with large clustering of the eigenvalues. Our approach provides a new perspective: 
Given input/output arguments in a binary_x floating point format, we use a higher precision bi- 
nary_y arithmetic to obtain the desired accuracy. As our analysis shows, the use of higher precision 
provides us with freedom in setting important parameters of the algorithm. In particular, we select 
these parameters to reduce the operation count, increase robustness, and improve parallelism; at 
the same time, we meet more stringent accuracy goals. Due to these changes, our mixed precision 
approach is not only as accurate as the Divide & Conquer method or the QR algorithm but - 
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under many circumstances - is also faster than these methods or even faster than a conventional 
implementation of MRRR. 

This work was mainly motivated by the results of MRRR-based eigensolvers for dense Hermitian 
problems [S^- In the context of dense eigenproblems, the tridiagonal stage is often completely 
negligible in terms of execution time: to compute k eigenpairs of a tridiagonal matrix, it only 
requires 0{kn) operations; the reduction to tridiagonal form requires 0{tt') operations and is the 
performance bottleneck. In terms of accuracy, the tridiagonal stage is responsible for most of the 
loss of orthogonality. The natural question was whether it is possible to improve the accuracy to 
the level of the best methods without sacrificing too much performance. As our results show, this 
is indeed possible. In fact, our mixed precision solver is even more accurate than the ones based 
on Divide & Conquer and QR, and remains as fast, or faster, than the classical MRRR. Finally, 
an important feature of the mixed precision approach is a considerably increased robustness and 
parallel scalability. 
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